\ifx \allfiles \undefined
\documentclass{article}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{graphicx}

\begin{document}
\title{Problem Formulation}
\maketitle \else \fi

\newcommand{\following}{\ensuremath{following}}
\newcommand{\follower}{\ensuremath{follower}}

\section{Problem Formulation}\label{sec:problem}
We formulate the social network in Twitter as a directed graph $G = (V,E)$.
Each node $u \in V = \{1, 2, \cdots, n\}$ represents a user, and a directed edge $(u,v) \in E$ indicates
user $u$ is following user $v$. The set of users who are following user $u$ is denoted as $\follower(u)$, and the set of users followed by $u$ is denoted as $\following(u)$. The size of these two sets are $|\follower(u)|$ and $|\following(u)|$, respectively.

%For each user $u$, let the set of tweets published by $u$ be $T_u = \{t_1, t_2, \cdots\}$, where each tweet $t_i \in T_u$ contains a set of words from $W = \{w_1, \cdots, w_m\}$. Then $T_u$ can be viewed as a bag of words from $W$. Let $\phi_u$ be the word occurrence vector in $T_u$, where $\phi_{u, i}$ equals to 1 if word $w_i$ appears in $T_u$, otherwise it equals to 0.
%Here $\phi_{u, i} = 1$ indicates that word $w_i$ appears in $u$'s tweets,
%and $\phi_{u, i} = 0$ indicates that word $w_i$ does not appear in $u$'s tweets.
%To simplify our problem, we ignore the number of occurrences of each word $w_i$.

%Each user $u$'s short bio is also separated into a set of keywords $\{z_1, z_2, \cdots\}$. For example, the keywords set for ``computer science student at UCLA'' is \{``computer science'', ``student'', ``UCLA''\}. The candidate keywords set is manually chosen, and then the short bio is separated based on the candidate keywords set.

%In addition to a user's short bio,
%we also collected a series of characteristic keywords
%$z_1, z_2, \cdots$ from his short bio. For example, a student study computer
%science in UCLA may write ``Computer science student at UCLA'' in his short bio.
%We extract his characteristic profile as
%$z_1=$Computer Science, $z_2$=student, $z_3=$UCLA.
%Because computer could not generate these characteristic profile automatically
%and accurately, we choose some phrases manually and consider other words in
%profile as single keyword for the user.

Given a category $\mathcal{C}$, we want to identify all the users that belong to $\mathcal{C}$. For example, if the category is ``people in UCLA'', the task is to identify all the users in UCLA. The user set $V$ is partitioned into two disjoint sets $\mathcal{A}$ and $\mathcal{B}$ based on some prior knowledge. It is almost for sure that the users in $\mathcal{A}$ belong to $\mathcal{C}$, whereas the results for users in $\mathcal{B}$ are unknown. For example, users in $V$ can be partitioned based on whether keyword ``UCLA'' appears in a user's biography or not. It is clear that users in $\mathcal{A}$ are highly likely to be UCLA-related. However, a user in $\mathcal{B}$ may still belong to $\mathcal{C}$ even if he does not have ``UCLA'' is his biography. A student who writes ``loving being a BRUIN and loving God!'' in his biography belongs to set $\mathcal{B}$, but he is actually a UCLA student.

%we might find a keyword $z_{\mathcal{C}}$ which best describes this category. For example, keyword ``UCLA'' might represent the category of people in UCLA. Users in $V$ can be partitioned into two sets $\mathcal{A}$ and $\mathcal{B}$ based on whether $z_{\mathcal{C}}$ appears in a user's short bio or not. It is clear that the users in $\mathcal{A}$ are highly likely to belong to $\mathcal{C}$. However, a user $u \in B$ may still belong to $\mathcal{C}$ even if he does not have $z_{\mathcal{C}}$ is his short bio. For example, a student may write ``loving being a BRUIN and loving God!'' in his short bio. Although he belongs to set $\mathcal{B}$ if the keyword is ``UCLA'', he is actually a ``UCLA'' student.
%Given a keyword or phrase $z_i$, we call users which write $z_i$ in their
%short bio directly as nodes $L={l_1, l_2, \cdots, l_n}$,
%while we call other users without the $p_i$ as nodes $D={d_1, d_2, \cdots, d_m}$.
%The total user set we have is $C=D \cup L$.
%Although some users in nodes set $D$ have not write $p_i$ in their profile,
%he actually owns the keyword or phrase in real life.
%As the example before, a student in UCLA could also write
%``loving being a BRUIN and loving God!'' in her short bio.

Our problem is to predict how a user $u \in \mathcal{B}$ is related to the category $\mathcal{C}$ given the user partition $\mathcal{A}$ and $\mathcal{B}$. We focus on the scenario where the prior knowledge is a keyword $z_{\mathcal{C}}$ which describes the category $\mathcal{C}$. Each user $u$ is related to a relevance score $s_u$ which represents the likelihood that $u$ belongs to $\mathcal{C}$. The higher the value of $s_u$, the more likely $u$ belongs to $\mathcal{C}$. The result is a ranking of users in $\mathcal{B}$ based the value of $s_u$.

%Now our task is to design some algorithms that for a specify keyword or
%phrase $z_i$, predict how user $u$ in nodes set $D$ is related with $z_i$.
%In order to achieve this goal, we may assign some weight on words appeared
%in user's tweets, short bio, or locations. Additionally, we may analysis
%user's link relationship with others.
%Our result would be a sorted list that ranks the user with high probability
%having $z_i$ in real life at first.

\ifx \allfiles \undefined
\end{document}
\fi
